Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce

نویسندگان

  • Foto N. Afrati
  • Anish Das Sarma
  • Anand Rajaraman
  • Pokey Rule
  • Semih Salihoglu
  • Jeffrey D. Ullman
چکیده

Algorithms for computing similarity joins in MapReduce were offered in [2]. Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the “anchor-points algorithm” of [2]. We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming distance d, and anchor points are chosen so that every possible input is within Hamming distance k of some anchor point, then it is sufficient to send each input to all anchor points within distance (d/2)+k, rather than d+k as was suggested in the earlier paper. This improves on the communication cost of the MapReduce algorithm, i.e., reduces the amount of data transmitted among machines. Further, the same holds for edit distance, provided inputs all have the same length n and either the length of all anchor points is n − k or the length of all anchor points is n+ k. We then explore the problem of finding small sets of anchor points for edit distance, which also provides an improvement on the communication cost. We give a close-to-optimal technique to extend anchor sets (called “covering codes”) from the k = 1 case to any k. We then give small covering codes that use either a single deletion or a single insertion, or – in one algorithm – two deletions. Discovering covering codes for edit distance is important in its own right, since very little work is known.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Anchor Points Algorithms for Hamming and Edit Distance

Algorithms for computing similarity joins in MapReduce were offered in [2]. Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the “anchor-points algorithm” of [2]. We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming...

متن کامل

Calculating Edit Distance for Large Sets of String Pairs using MapReduce

Given two strings X and Y over a finite alphabet, the edit distance between X and Y , d(X,Y ) is the number of elementary edit operations required to edit X into Y . A dynamic programming algorithm elegantly computes this distance. In this paper, we investigate the parallelization of calculating edit distance for a large set of strings using MapReduce, a popular parallel computing framework. We...

متن کامل

On the hardness of maximum rank aggregation problems

The rank aggregation problem consists in finding a consensus ranking on a set of alternatives, based on the preferences of individual voters. The alternatives are expressed by permutations, whose pairwise distance can be measured in many ways. In this work we study a collection of distances, including the Kendall tau, Spearman footrule, Minkowski, Cayley, Hamming, Ulam, and related edit distanc...

متن کامل

Obliviously Approximating Sequence Distances

There are several applications for schemes which approximately nd the distance between two sequences in a way that isòblivious' of one of the sequences up until a nal sublinear number of comparisons. This paper shows how sequences can be preprocessed obliviously to give a binary string, so that a simple vector distance between two bitstrings gives an approximation to a sequence distance of inte...

متن کامل

Error Tree: A Tree Structure for Hamming & Edit Distances & Wildcards Matching

Error Tree is a novel tree structure that is mainly oriented to solve the approximate pattern matching problems, Hamming and edit distances, as well as the wildcards matching problem. The input is a text of length n over a fixed alphabet of length Σ, a pattern of length m, and k. The output is to find all positions that have ≤ k Hamming distance, edit distance, or wildcards matching with P . Th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014